Measurement challenges and causes of incomplete results reporting of biomedical animal studies: Results from an interview study

Background Existing evidence indicates that a significant amount of biomedical research involving animals remains unpublished. At the same time, we lack standards for measuring the extent of results reporting in animal research. Publication rates may vary significantly depending on the level of measurement such as an entire animal study, individual experiments within a study, or the number of animals used. Methods Drawing on semi-structured interviews with 18 experts and qualitative content analysis, we investigated challenges and opportunities for the measurement of incomplete reporting of biomedical animal research with specific reference to the German situation. We further investigate causes of incomplete reporting. Results The in-depth expert interviews revealed several reasons for why incomplete reporting in animal research is difficult to measure at all levels under the current circumstances. While precise quantification based on regulatory approval documentation is feasible at the level of entire studies, measuring incomplete reporting at the more individual experiment and animal levels presents formidable challenges. Expert-interviews further identified six drivers of incomplete reporting of results in animal research. Four of these are well documented in other fields of research: a lack of incentives to report non-positive results, pressures to ‘deliver’ positive results, perceptions that some data do not add value, and commercial pressures. The fifth driver, reputational concerns, appears to be far more salient in animal research than in human clinical trials. The final driver, socio-political pressures, may be unique to the field. Discussion Stakeholders in animal research should collaborate to develop a clear conceptualisation of complete reporting in animal research, facilitate valid measurements of the phenomenon, and develop incentives and rewards to overcome the causes for incomplete reporting.


Introduction
The issue of incomplete reporting, when study outcomes are reported only partially or not at all, has attracted growing attention across numerous fields of natural and social science, raising questions about the rigour, efficiency, ethics and integrity of the scientific process, and the reliability, replicability and robustness of research findings [1][2][3].
In biomedical research, incomplete reporting and its effects on publication bias is well documented for human clinical trials [4,5], but far less so for animal studies [6][7][8][9]. The lack of information about publication rates in animal research could be partly explained by practical barriers and conceptual uncertainties for the empirical measurement of publication rates. In principle, the measurement of incomplete reporting of results in biomedical research is facilitated by the requirement for researchers to obtain advance approval for studies involving humans or live animals. To obtain approval, researchers must pre-specify all planned experiments within a study and the number of 'participants' in each experiment. Reporting gaps in clinical trials have repeatedly been quantified by using ethics committee approvals or funder cohorts to establish a cohort of all studies conducted, and then searching the literature for their published outcomes [10][11][12]. In European Union member states, legally mandated approvals by official bodies could be used to establish similar cohorts of animal studies.
Two separate groups recently used this approach to quantify non-publication of animal research on the level of entire animal studies and on the level of approved animals numbers. One group, where several authors from this paper participated, found that of 158 approved studies at two German university medical centres that had verifiably been initiated, 33% had published outcomes neither in the scientific literature nor within doctoral theses [13]. Another group found that of 67 approved studies at a Dutch university, 40% did not result in publications [14]. But has the publication rate on the level of approved animal studies sufficient construct validity? This question is important because entire animal studies might reflect several experiments which all include different animals. A 2011 survey among 454 Dutch animal researchers asked for the publication rate on the experiment level. The surveyed researchers estimated that about 50% of all experiments remain unpublished [15]. Furthermore, the above mentioned follow-up study at a Dutch university also assessed the publication rate on the level of animal numbers mentioned in the 67 approved study applications and found that journal articles did not report outcomes for 74% of the mentioned animals.
What of these three different concepts for non-publication has the highest construct validity? The 33-40% at the study level [13,14], the 50% at the experiment level [15], or the 74% at the animal level [14]? Furthermore, how good is the internal and external validity of these three types of measures? Because reporting on the number of animals used in specific experiments both in approval documents [16] and in journal publications [6] is limited the measurement of reporting rates at all three levels might face substantial challenges.
The goal of this study was to explore qualitatively the measurement challenges and causes of incomplete reporting of the results of animal studies, differentiating between three levels of measurement: (1) overall study as approved by regulators, (2) discrete experiments nested within the study, and (3) individual animals used within experiments.

Methods
This study is reported in line with the Consolidated criteria for reporting qualitative research (COREQ) guideline.
We used purposive sampling to gain perspectives from different animal researchers and other stakeholders. The primary purpose was to obtain as complete a picture as possible of the different causes for incomplete results reporting in animal research and the opportunities and challenges in measuring incomplete reporting. Within the interviews we introduced three levels of analysis for results publication: study, experiment, and subjects. See Table 1 for examples and how to compare these three level of analysis with the area of clinical research.
Due to the sensitivity of the subject matter, initial participants were recruited from the study team's professional networks, with further participants recruited via snowballing. As our study team focuses on responsible research, this likely biased our sample towards respondents with a high awareness of issues relevant for incomplete reporting. Interviewees were offered 150 Euros to compensate them for the time required to participate in the study. Following a purposive and iterative sampling strategy we recruited 18 interviewees (from 26 contacted, response rate 69%) until we reached thematic saturation of mentioned topics. While respondents were drawn from multiple regions in Germany and multiple levels of seniority, ranging from postdoctoral researchers to senior academics, we did not attempt to recruit a representative but aimed for a purposive sample with diverse backgrounds and perspectives on the topic of comprehensive reporting. While governmental competencies might differ in some details across German federal states the overarching regulatory requirements and the concept of animal studies incorporating several animal experiments is the same throughout Germany. All interviewees met our inclusion criterion of having experience in conducting, supervising and/ or publishing the findings of animal research (S1 File).
All participants were asked to sign a consent form that outlined the basic research questions, informed participants that interviews would be recorded and transcribed, and that their anonymity would be safeguarded. Participants consented to selected quotes from interviews being cited verbatim in a future publication (S2 File). All interviews were conducted via video call by the same member of the study team (TB), in 2 cases in conjunction with another team member (ND or UT), based on a written interview guide that had been developed by the team of authors in an iterative process (S3 File). Because the team of authors include five persons with background in animal research, we did not conduct pilot interviews. The lead interviewer is a German postdoctoral researcher with extensive experience in conducting qualitative research, including on publication bias, but with no personal experience of conducting animal research; his professional background was disclosed verbally at the outset of each interview. All interviews were between 50 and 70 minutes in length, with a mean length of 60 minutes.

AS Individual animal Non-reporting of outcomes for some animals
All interviews were transcribed by a professional transcription company that had signed a non-disclosure agreement. The lead interviewer (TB) reviewed all transcripts and manually grouped responses into thematic categories initially broadly mirroring key items in the interview guide and subsequently further sub-categorising them until thematic saturation was reached as per the criteria elaborated by Fusch and Ness [17]. Further team members (SW, NT, UT, DS) reviewed and commented the categorisation; disputes were resolved by consensus. Quotes cited in the paper (as "Q99") were selected by eliminating duplicate quotes on the same topic until arriving at the quote or quotes that best summarised the tenor of all responses received on that issue. Quotes were translated by a bilingual researcher (TB) and are available in German and English language (S4 File). The interview transcripts, slightly redacted to further safeguard the anonymity of participants, were archived on a password-protected server at the Charité, Berlin, Germany.
The study was preregistered on the Open Science Framework (https://osf.io/34qny/) and was approved by the Medizinische Hochschule Hannover, Hannover, Germany ethics committee (number 9504_BO_K_2020).

Interviews
We conducted confidential semi-structured interviews with 18 experts (14 animal researchers, 2 methodology experts, 1 journal editor, 1 industry group representative; 16 located in Germany and 2 in UK) conducted during May-June 2021. The main categories identified in the interviews became increasingly saturated after approximately 10 interviews. While the next eight interviews provided further perspectives on sub-categories and particularities no new major categories emerged. While thematic saturation was reached for main categories we may not have captured minor or rare factors.
Quotes exemplifying the themes presented in the following sections are displayed in Table 2 and more exhaustively in S4 File.

Measurement challenges of incomplete reporting
Incomplete reporting of results can take place on three levels. Researchers can decide to not report the results of an entire study, or of discrete experiments nested within each study, or of individual animals used (see Table 1).
Study level tracking challenges. Respondents concurred that generating animal study cohorts from regulatory approvals (including amendments) and then searching the literature for related publications is an appropriate way to measure incomplete reporting at the study level, albeit with three caveats.
First, some commercial and non-commercial funders require approvals to be in place before reviewing funding applications, and some studies that receive regulatory approval subsequently fail to secure funding (Q1). Second, there is a substantial delay between filing a project evaluation (Tierversuchsantrag) with regional German authorities and receiving authorization; multiple respondents cited nine months as a typical time span, though this may vary by federal region (Bundesland) and individual study. Turnover of staff (Q2) or new scientific developments (Q3) during that waiting period may lead a study team to decide not to initiate an authorized project. Third, literature searches for animal study results face substantial challenges (see further below).
Experiment level tracking challenges. Respondents concurred that in the German context, application documents for project evaluations (Tierversuchsantraege) by themselves

Lack of incentives to report negative and null results
Hard to publish high impact R02 Q26 If it's a single, negative result, and you have another 20 papers to publish with positive results, then it is very likely which you will tackle first, because they promise a high impact and so on. (. . .) Do I aim at one big publication, to become visible, or do I-in inverted commas-"waste" time for the publication of negative results that possibly benefit other people in decades down the line? You can't burden young ECRs with that. But at the end of the day, it's them who have to generate the data and do the groundwork. R02 Q29 Regarding the publication of negative data, it depends on how confrontational or spectacular a result is. If for example a certain mechanism was postulated for decades and now it is shown in an animal study that that is not the case, then you can surely publish such negative data very prominently.

Pressures to deliver positive results
Selective reporting R08 Q33 I think that's the most frequent kind of fault, that animals are excluded. (. . .) I don't stand behind every person using a pipette. But I believe that of course things like that happen. People are under pressure, they need a job, else they don't have anything to do-there's no need to deceive ourselves. And I believe that that can also lead to wrong ["falschen"] human trials.
Perceptions that some data do not add value (Continued ) cannot be used to meaningfully and reliably measure incomplete publication at the experiment level. Several respondents highlighted that the German project evaluation and authorization system requires all experiments to be specified in great detail many months before work on a study begins, while the exploratory nature of their research requires flexibility to modify the study design as work progresses and new insights emerge.
Respondents concurred that German researchers routinely seek to maintain this flexibility by crafting applications that incorporate a very wide range of possible experiments and a large number of animals, to cover possible future contingencies (Q4, Q5, Q6, Q7, Q8, Q50).
In vitro studies or early stage experiments often show an initially envisaged line of enquiry to be futile, or required compounds or materials cannot be secured. When this happens, the originally planned experiments are never performed (Q9, Q10).
Conversely, when a line of enquiry appears fruitful, new research questions may emerge. To be able to address those, researchers file requests to modify the original approvals (Aenderungantraege) by replacing predefined experiments with new ones, while keeping animal numbers constant. This way, an original application may be modified dozens of times (Q11). However, if discrete experiments are never performed, this is rarely reported back to the authorities (Q12).
Measuring incomplete reporting at the experiment level thus requires taking the original approval as the starting point, and then working sequentially through all subsequent Aenderungantraege. While time consuming (Q13), this can be used to establish an upper limit for the (. . .) And when it isn't expedient, then it gets swept beneath the carpet rather than somehow being brought into connection with a product.

Reputational concerns
May suggest study was flawed R01 Q43 Was that a thought through hypothesis, or was that from the outset a hypothesis that cannot work at all?
Drop-out rates can reflect on competence R15 Q45 People don't want to say that their surgery is only effective 50 percent of the time (. . .) They want to give this impression that everything works all the time, but science is messy.

Socio-political and regulatory pressures
Stigma and external pressure R06 Q49 I think we need to get away from this, 'I am being controlled or even punished'. Sometimes it's scary. If you try to follow the rules, you actually are more afraid that someone could point the finger at you, because there is suddenly something to control. But if I don't document anything, I run less danger.

PLOS ONE
number of experiments that might have been performed, but not the precise number of experiments actually performed. Animal level tracking challenges. Similarly, researchers may terminate experiments early, using fewer animals than planned (Q14, Q15), but such reductions are rarely reported back to the authorities (Q16, Q17). Therefore, applications for animal research (Tierversuchsantraege) in conjunction with subsequent modification requests (Aenderungsantraege) can be used to establish an upper limit for the number of animals that might have been used, but not the precise number of animals actually used.
Literature matching challenges. In human clinical trials, a single journal article typically describes the outcomes of a single trial. Tracing outcome publications for animal studies is far more challenging because a scientific research project can involve multiple applications for animal research (Tierversuchsantraege), and multiple experiments nested within several applications may later be recombined into a single scientific paper (Q18, Q19). In addition, some outcomes may only get published within doctoral theses or in other formats, or kept on file indefinitely until they can be 'fitted' into a future broader publication (Q20, Q21).
Causes of incomplete reporting. According to respondents, there are six drivers of incomplete reporting of results in animal research: a lack of incentives to report non-positive results, pressures to 'deliver' positive results, perceptions that some data do not add value, commercial pressures, reputational concerns, and socio-political and regulatory pressures.
Lack of incentives to report negative and null results. Respondents unanimously concurred that the lack of incentives for academic researchers to publish 'negative' or 'null' findings is a major driver of non-reporting at all levels: project, experiment and individual animal (Q22). High impact journals that are crucial to academics' career progression and ability to attract future funding are commonly not interested (Q51) in publishing non-positive findings (commonly defined through p-values, Q23) or replications (Q28), regardless of methodological rigour or scientific merit (Q24). Some respondents mentioned that if a paper on a 'positive' project includes non-positive results for discrete experiments, editors or reviewers often remove these (Q25). Publication in lower impact journals is unattractive, mainly due to opportunity costs (Q26), but also because achieving tenure can hinge on a researcher having a high impact average across all publications (Q27).
However, some respondents noted that non-positive findings can be published high impact if they refute previous landmark findings in the field (Q29). While the evidence bar may be set higher in such cases, such papers can later attract many citations (Q31).
Pressures to deliver positive results. Career pressures to deliver clearly 'positive' results can drive some researchers to omit the data for some animals in journal articles (Q32). Respondents believed that such selective reporting is not uncommon (Q33, Q34).
Perceptions that some data do not add value. Furthermore, some respondents thought that reporting some data was unnecessary as it would not add any scientific value (Q52).
Examples cited included data ruined by laboratory accidents (Q35), pre-intervention and pre-measurement dropouts (Q36, Q37), unexplained failures due to unknown variables (Q38), and experiments terminated after only very few animals were used (Q39).
Commercial pressures. When studies are funded by commercial entities, funders may sometimes object to the publication of results because they are viewed as commercially confidential (Q42) or because they reflect negatively on the product being tested (Q41).
Reputational concerns. Some respondents also pointed out that an absence of 'positive' results could indicate that a study was badly conceived (Q43), and that having high drop-out rates pre-experiment could be interpreted as a lack of skills (such as surgical skills) by an individual or study team (Q44, Q45), potentially exposing researchers to criticism even when a study was well designed and implemented.
Socio-political and regulatory pressures. Such reputational concerns are compounded by legal and reputational pressures. Widespread public and political animosity towards animal research in Germany (Q46, Q47) and close monitoring by activist groups (Q48, Q49) can disincentivise the sharing of failures and 'negative' results.

Measuring incomplete reporting
Generating meaningful and reliable data on the extent of incomplete publication of biomedical animal studies is challenging in the German context. At the study level, it requires verifying that approved studies were actually initiated post-approval and taking into account complex publication pathways. At the experiment level and animal level, it requires analysing approvals plus numerous modification requests (Aenderungsantraege). This time consuming methodology can establish an upper bound for the numbers of experiments performed and/or animals used, but will typically not capture post-approval reductions in the numbers of experiments of animals. To generate precise data on experiments performed and/or animals used, addition data, for example from laboratory notebooks, specific documentation of animal research facilities, or other source would be required.

Causes of incomplete reporting
Respondents flagged six drivers of incomplete reporting of results in biomedical animal research. Four of these drivers-lack of incentives to report certain results, pressures to 'deliver' positive results, perceptions that some data do not add value, and commercial pressuresclosely match drivers for incomplete reporting in other areas of research [18][19][20][21][22]. The fifth driver-reputational concerns-may play a far greater role in animal research than in human drug trials, possibly because investigators' technical skills can affect animal survival rates more directly.
The sixth driver of incomplete reporting-socio-political and regulatory pressures-may be a specific feature of animal studies. The lack of social and political consensus in Germany on the desirability of conducting such research in the first place, combined with the vigilance of advocacy groups, generates an environment that discourages reporting the results of 'failed experiments' involving animals. In contrast, there is overwhelming social and political consensus that running well-designed clinical trials in humans is desirable, and a tacit understanding that clinical equipoise dictates that some participants in clinical trials may fail to experience benefits or even suffer harms.
The finding that reputational concerns are a strong driver of incomplete reporting in this field may merit further research. For example, surgeons' skill and experience can affect the success rates for surgery [23,24]. Future research could explore whether reputational concerns influence the reporting of clinical trials of surgical interventions.

Publicly available information about individual animal experiments
Incomplete reporting of animal studies can be reliably quantified at the study level using approval documentation in Europe, as two previous studies have already done [13,14]. Individual animal studies, however, mostly comprise several experiments with hundreds of animals. The reporting rate at the study level, therefore, is conceptually flawed as it only captures whether any results of any experiment with any animals were reported. For a more meaningful understanding of the extent of incomplete reporting in animal research, the measurement of results reporting at the level of individual experiments or individual animals is needed.
Efforts at precise quantification at the experiment level and animal level, however, would require additional data about what animal experiments are ultimately conducted. This information is currently not accessible for systematic evaluations. One potential source for this kind of information could be a local or national documentation about the characteristics of all animal experiments started and completed at university-based animal research facilities. The comprehensive preregistration of individual animal experiments could also facilitate measurements on publication rates as demonstrated for clinical research [25,26]. Preregistration of animal studies, however, is still in its infancy [27,28].

Limitations
This study has two limitations. First, there may be additional factors contributing to incomplete reporting in animal research that were not identified by respondents. The interview guide asked the open-ended question of what the most common causes of incomplete reporting were, and we have reported all causes flagged by respondents. However, it is possible that additional, less common causes were not flagged during 18 hours of interviews. Second, it is unclear whether and to what extent its findings are generalizable beyond the specific context of animal research conducted within Germany only. EU countries conducting animal research under the directive 2010/63/EU might experience similar challenges for measuring publication rates if they apply a similar study application and approval system that integrate different animal experiments under the umbrella of one animal study.

Concept for complete results reporting in animal research
Our research indicates that the concept of complete reporting in animal research remains contested and underdefined. While reporting guidelines such as ARRIVE (Animal research: reporting in vivo experiments) [29] reflect reporting quality further guidance is needed that specify what data out of all animal studies require reporting to guarantee an unbiased knowledge gain and what data do not merit reporting in this regard? Furthermore, what dissemination routes qualify as appropriate results reporting? Several journals explicitly invite the submission of "negative" or "undesired" results such as PLoS One or BMJ Open Science. Other journals certainly should follow this example to facilitate an unbiased results reporting. Beside peer-reviewed journal articles also preprints, data repositories, or summary results in publicly accessible registries/databases might become important formats for results dissemination. In clinical trials, for example, the reporting of summary results in trial registries has become a broadly accepted alternative to journal publication.
Future efforts to improve results reporting in animal studies should take into account socio-political pressures because in some contexts these can be significant factors discouraging the reporting of pre-experimental dropout rates and 'null' and 'negative' outcomes. More incentives and rewards, including career incentives, for complete results reporting in animal research might help to improve the status quo. Similar to the development of reporting guidelines [29,30] or other guidelines from the Laboratory Animal Science Association (LASA) [31] the relevant stakeholder groups in animal research, including animal researchers, animal research facilities, funders, expert networks, and regulators should work together to develop guidance and best practice standards for comprehensive results reporting. Once such guidance is available the valid measurements of the extent and consequences of incomplete reporting in animal research should be facilitated by academic institutions and regulators.